Skip to content

feat: complete Arrow type mapping for parametric and missing types#16

Merged
eddietejeda merged 7 commits into
mainfrom
feat/complete-arrow-type-mapping
May 26, 2026
Merged

feat: complete Arrow type mapping for parametric and missing types#16
eddietejeda merged 7 commits into
mainfrom
feat/complete-arrow-type-mapping

Conversation

@eddietejeda

@eddietejeda eddietejeda commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fixes `int8` being silently mapped to `Int64` (Postgres `int8` means 8-byte/64-bit; Arrow `int8` means 8-bit)
  • Adds `halffloat`, `large_string`, and all signed/unsigned integer variants to the simple alias map
  • Adds `_pa_type_from_arrow_str()` using PyArrow as the authoritative type system — parametric Arrow types (`timestamp[unit, tz=...]`, `duration[unit]`, `decimal128(p, s)` / `decimal256`, `list<item: T>` / `large_list<item: T>`) are constructed as `pa.DataType` instances and bridged to Ibis via `PyArrowType.to_ibis()`
  • Handles non-nullable list item fields (`not null` suffix stripped before recursive resolution)
  • Fixes decimal regex: `decimal128?` → `decimal(?:128|256)?` so `decimal12(...)` is not matched
  • Removes `managed.py` (single-constant module); inlines `"public"` at both call sites
  • Code quality: replaces walrus-then-overwrite in `_to_catalog_db_tuple`, uses `hasattr` instead of `getattr(...) is not None`, replaces magic HTTP status codes with `http.HTTPStatus` constants, iterates schema fields directly instead of `range(len(columns))`
  • Updates README: full `connect()` parameter documentation, URL form, schema-only `create_table`, `force=True` on drops, `to_pyarrow_batches()` behavior note, updated feature table

Before this change, Parquet-loaded tables with parametric Arrow column types would silently fall back to `String` or `Unknown`, losing type information.

Test plan

  • `uv run pytest tests/` — 98 tests pass (33 new cases added)
  • New parametrized tests cover: all 4 timestamp units, timezone variants (`UTC`, `America/New_York`), scale values per unit, all 4 duration units, unknown duration unit falls back (not `Interval`), decimal precision/scale (`decimal128`, `decimal256`, bare `decimal`), `decimal12` not matched, list with various element types and non-nullable items, case-insensitivity throughout

Add support for all common Arrow type strings returned by Hotdata's
information_schema when tables are loaded from Parquet/Arrow sources.

Simple additions to _ARROW_TYPE_MAP:
- int8 → dt.Int8 (Postgres parser wrongly returns Int64 for "int8")
- halffloat → dt.Float16 (PyArrow's str() name for float16
- large_string → dt.String (PyArrow large-offset string variant)

New _parse_parametric_arrow_type() for types with embedded parameters:
- timestamp[us] / timestamp[us, tz=UTC] → dt.Timestamp(timezone=...)
- duration[ms] → dt.Interval(unit='ms')
- decimal128(10, 3) / decimal(5, 2) → dt.Decimal(precision, scale)
- list<item: T> / large_list<item: T> → dt.Array(T) (recursive)

Adds 27 new test cases covering all new patterns including timezone
variants, all 4 time units, case-insensitivity, and recursive list
element type resolution.
EOF
)
Comment thread src/ibis_hotdata/types.py Outdated
Comment thread src/ibis_hotdata/types.py Outdated
Comment thread src/ibis_hotdata/types.py Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • src/ibis_hotdata/types.py:63-66_TIMESTAMP_RE captures the Arrow time unit in group 1 but the handler discards it, so every parametric timestamp (timestamp[s], timestamp[ms], timestamp[us], timestamp[ns]) is constructed as dt.Timestamp() with default scale. dt.Timestamp is a parametric type — different scale values produce different types — so the unit information this PR claims to recover is silently lost. The existing tests pass only because they don't assert on scale.

Action Required

  • Map the captured unit to an Ibis scale (s→0, ms→3, us→6, ns→9) and pass it to dt.Timestamp(...).
  • Extend test_dtype_from_hotdata_arrow_timestamp to assert the resulting scale so this regression is caught.

See inline comments for the suggested patch and two non-blocking nits (decimal regex shape, silent unknown-unit fallback).

Two bugs found by Codex review:

1. Timestamp scale was discarded: Arrow timestamp[ms] should map to
   dt.Timestamp(scale=3), not dt.Timestamp(). Add unit-to-scale mapping
   s=0, ms=3, us=6, ns=9, matching PyArrow convention.

2. Non-nullable list items mis-parsed: PyArrow emits
   'list<item: int32 not null>' for non-nullable item fields. Strip the
   ' not null' suffix and pass nullable=False to the recursive call so
   the element type is correctly typed instead of falling back to Unknown.
claude[bot]
claude Bot previously approved these changes May 26, 2026
Two nits from PR review:

- decimal regex 'decimal128?' made only the trailing 8 optional, so it
  matched the non-existent 'decimal12' form. Changed to
  'decimal(?:128|256)?' to accept decimal, decimal128, or decimal256.

- _ARROW_UNIT_TO_IBIS.get(..., 's') silently mapped any unrecognised
  duration unit (e.g. duration[foo]) to Interval(s). Now returns None
  so the caller falls through to the Postgres parser / String fallback.
claude[bot]
claude Bot previously approved these changes May 26, 2026
Instead of constructing Ibis types directly (with hand-coded unit->scale
mappings etc.), resolve Arrow type strings to PyArrow types first and
convert via ibis.formats.pyarrow.PyArrowType.to_ibis().

- Replace _ARROW_TYPE_MAP (str -> Ibis class) with _PA_TYPE_MAP
  (str -> pa.DataType instance), covering all scalar types that can
  appear as list element types
- Merge _parse_parametric_arrow_type() into _pa_type_from_arrow_str()
  which returns a pa.DataType or None
- Remove _ARROW_UNIT_TO_IBIS and _ARROW_UNIT_TO_TIMESTAMP_SCALE —
  PyArrow encodes unit/scale semantics natively
- Fix decimal256: pa.decimal128 rejects precision > 38; fall back to
  pa.decimal256 for wider values
- dtype_from_hotdata_sql_type() simplified to: try Arrow path via
  PyArrow bridge, then Postgres parser, then String fallback
claude[bot]
claude Bot previously approved these changes May 26, 2026
- backend.py: replace walrus-then-overwrite pattern with plain assignments
  in _to_catalog_db_tuple
- backend.py: getattr(self, '_http', None) is not None -> hasattr(self, '_http')
- backend.py: add explanatory comments to the silent pass in
  _resolve_database_connection_id and the no-op _register_in_memory_table
- http.py: replace magic HTTP status integers with http.HTTPStatus constants
- http.py: replace range(len(columns)) index loop with direct field iteration
- types.py: cache m.group(2) to avoid calling it twice
- Remove managed.py (single-constant module); inline 'public' directly at
  the two call sites in backend.py and http.py
claude[bot]
claude Bot previously approved these changes May 26, 2026
- Correct ibis-framework version requirement to >=10,<11
- Document all connect() optional parameters with inline comments
- Add URL query string example with optional parameters
- Document schema-only create_table (empty table from schema)
- Document force=True on drop operations
- Note to_pyarrow_batches() downloads full result then splits locally
- Add in-memory tables (unsupported) and Arrow type mapping to feature table
claude[bot]
claude Bot previously approved these changes May 26, 2026
- Change default schema public to main to match runtimedb DEFAULT_SCHEMA_NAME;
  runtimedb auto-inserts main into every managed database, so using public
  silently created a spurious empty main schema alongside the declared one
- Trim _IN_FLIGHT to {running} — runtimedb QueryRunStatus only emits
  running, succeeded, and failed; queued and pending are result statuses
- Update README examples to use main schema throughout

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior review feedback addressed:

  • Timestamp scale now preserved via the PyArrow→Ibis bridge (pa.timestamp(unit, tz=tz) + PyArrowType.to_ibis()), with tests asserting out.scale per unit.
  • Decimal regex tightened to decimal(?:128|256)?, with a regression test for decimal12.
  • Unknown duration units now return None (via pa.duration raising) and fall through instead of silently mapping to seconds, with a regression test.

@eddietejeda eddietejeda merged commit 1a00ec5 into main May 26, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant